United States and India are the two countries that produce the most content on Netflix

Count Of Content By Country
country n
1 United States 3297
2 India 990
3 United Kingdom 723
5 Canada 412
6 France 349
7 Japan 287
8 Spain 215
9 South Korea 212
10 Germany 199
11 Mexico 154
12 China 147
13 Australia 144
14 Egypt 110
15 Turkey 108
16 Hong Kong 102
17 Italy 90
18 Brazil 88
19 Belgium 85
20 Taiwan 85
21 Argentina 82
22 Indonesia 80
23 Philippines 78
24 Nigeria 76
25 Thailand 65
26 South Africa 54
27 Colombia 45
28 Netherlands 45
29 Denmark 44
30 Ireland 40
31 Singapore 39
32 Sweden 39
33 Poland 36
34 United Arab Emirates 34
35 Norway 29
36 New Zealand 28
37 Russia 27
38 Chile 26
39 Israel 26
40 Lebanon 26
41 Malaysia 26
42 Pakistan 24
43 Czech Republic 20
44 Switzerland 17
45 Uruguay 14
46 Romania 12
47 Austria 11
48 Finland 11
49 Luxembourg 11
50 Greece 10
51 Peru 10
52 Saudi Arabia 10
53 Bulgaria 9
54 Hungary 9
55 Iceland 9
56 Jordan 8
57 Kuwait 7
58 Qatar 7
59 Serbia 7
60 Morocco 6
61 Cambodia 5
62 Kenya 5
63 Vietnam 5
64 West Germany 5
65 Croatia 4
66 Ghana 4
67 Iran 4
68 Portugal 4
69 Bangladesh 3
70 Malta 3
71 Senegal 3
72 Slovenia 3
73 Soviet Union 3
74 Ukraine 3
75 Venezuela 3
76 Zimbabwe 3
77 Algeria 2
78 Cayman Islands 2
79 Georgia 2
80 Guatemala 2
81 Iraq 2
82 Namibia 2
83 Nepal 2
84 Afghanistan 1
85 Albania 1
86 Angola 1
87 Armenia 1
88 Azerbaijan 1
89 Bahamas 1
90 Belarus 1
91 Bermuda 1
92 Botswana 1
93 Cuba 1
94 Cyprus 1
95 Dominican Republic 1
96 East Germany 1
97 Ecuador 1
98 Jamaica 1
99 Kazakhstan 1
100 Latvia 1
101 Liechtenstein 1
102 Lithuania 1
103 Malawi 1
104 Mauritius 1
105 Mongolia 1
106 Montenegro 1
107 Nicaragua 1
108 Panama 1
109 Paraguay 1
110 Puerto Rico 1
111 Samoa 1
112 Slovakia 1
113 Somalia 1
114 Sri Lanka 1
115 Sudan 1
116 Syria 1
117 Uganda 1
118 Vatican City 1
Count Of Genre Content Produced In US & India
country listed_in n
India International Movies 811
India Dramas 611
United States Dramas 584
United States Comedies 525
United States Documentaries 421
United States Independent Movies 310
United States Children & Family Movies 303
India Comedies 300
United States Action & Adventure 245
United States TV Comedies 228
United States Stand-Up Comedy 212
United States Thrillers 205
United States TV Dramas 196
United States Kids’ TV 170
United States Docuseries 169
United States Romantic Movies 166
United States Horror Movies 152
India Independent Movies 140
India Action & Adventure 127
United States Sci-Fi & Fantasy 127
United States Crime TV Shows 114
India Romantic Movies 113
United States Music & Musicals 109
United States Reality TV 109
India Music & Musicals 94
United States Sports Movies 94
India Thrillers 88
United States TV Action & Adventure 77
India International TV Shows 60
United States Classic Movies 59
United States LGBTQ Movies 54
United States TV Sci-Fi & Fantasy 52
United States TV Mysteries 42
United States Science & Nature TV 41
United States International Movies 38
United States International TV Shows 38
United States Cult Movies 37
United States Romantic TV Shows 35
India Horror Movies 33
United States Stand-Up Comedy & Talk Shows 33
United States Faith & Spirituality 30
United States TV Horror 30
United States Teen TV Shows 29
India TV Comedies 25
India TV Dramas 25
United States Movies 22
United States TV Thrillers 22
India Children & Family Movies 19
India Documentaries 19
United States Classic & Cult TV 16
United States Spanish-Language TV Shows 16
India Sports Movies 15
India Classic Movies 11
India Kids’ TV 11
United States Anime Series 11
India Sci-Fi & Fantasy 10
India Crime TV Shows 9
India Romantic TV Shows 9
United States British TV Shows 9
India Docuseries 7
India TV Horror 7
India Stand-Up Comedy 6
India Cult Movies 5
India TV Action & Adventure 5
India Faith & Spirituality 3
India Reality TV 3
India Stand-Up Comedy & Talk Shows 3
India TV Mysteries 3
India TV Sci-Fi & Fantasy 3
India TV Thrillers 3
United States TV Shows 3
India LGBTQ Movies 2
India TV Shows 2
United States Anime Features 2
United States Korean TV Shows 2
India British TV Shows 1
India Teen TV Shows 1

Duration Distribution of USA and Indian Produced Movies (all genres)


USA and India: taking a closer look at genre


USA and India: Taking a closer look at maturity rating proportion distribution (regardless of genre)


Most Frequent Description Words for Dramas Produced In India


Most Frequent Description Words for Dramas Produced In USA


Duration Distribution of USA and Indian Produced Drama Movies


Duration Across Genres


---
title: "Analyzing Netflix Content Produced in USA and India"
output: 
  flexdashboard::flex_dashboard:
    theme:
      version: 4
      bg: "#ffffff"
      fg: "#101010" 
    orientation: columns
    storyboard: true
    social: menu
    source: embed
---

```{r setup, include=FALSE}
library(flexdashboard)
library(NetflixData)
library(tidyverse)
library(ggplot2)
library(ggthemes)
library(plotly)
library(kableExtra)
library(SnowballC)
library(RColorBrewer)
library(dplyr)
library(e1071)
library(mlbench)
library(bslib)
library(ggridges)

#Text mining packages
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(thematic)
```


### United States and India are the two countries that produce the most content on Netflix

```{r}
c <- country_data %>%
  count(country) %>%
  arrange(desc(n)) %>%
  na.omit() 

c %>%
  kbl(caption = "Count Of Content By Country") %>%
  kable_material_dark("striped") %>%
  scroll_box(width = "825px", height = "250px")
```


```{r}
c2 <- country_listedin %>%
  group_by(country) %>%
  filter(country %in% c("United States", "India")) %>%
  count(listed_in) %>%
  arrange(desc(n)) %>%
  na.omit()

c2 %>%
  kbl(caption = "Count Of Genre Content Produced In US & India") %>%
  kable_material_dark("striped") %>%
  scroll_box(width = "825px", height = "250px")
```

***

- The data frame is sorted from countries that produce the *most to least* Movies and TV Shows that are available on **Netflix** *since 2019*

- **3297** Movies and TV shows available on Netflix were produced in the **United States**
- **990** Movies and TV Shows available on Netflix were produced in **India**

- What kind on questions can we answer using this dataset/package?

- How does duration of movies differ in these countries?

- How do the Movies and TV Shows that are produced in these countries differ?

- What genres are popular in the United States? What genres are popular in India?

- What is the maturity rating distribution like for these two countries?

- Drama seems to be a popular genre for both the United States and India. How do these two countries describe films under the genre category?


### Duration Distribution of USA and Indian Produced Movies (all genres)

```{r}
india_duration <- dplyr::filter(netflix, grepl('India', country))
india_duration <- dplyr::filter(india_duration, grepl('Movie', type))

usa_duration <- dplyr::filter(netflix, grepl('United States', country))
usa_duration <- dplyr::filter(usa_duration, grepl('Movie', type))

fig_dur <- plot_ly(y = ~india_duration$duration, color = I("red"), type = "box", name = "India")
fig_dur <- fig_dur %>% add_trace(y = ~usa_duration$duration, color = I("black"), name = "USA")
fig_dur <- fig_dur %>% 
  layout(title = "India vs USA duration of Movie", 
        yaxis = list(title = "Duration In Minutes"))

fig_dur
```

***

- This plot shows a boxplot distribution of duration of Movies produced in India and the United States *regardless* of genre
- The majority of Indian produced movies are Bollywood produced movies and it is known that bollywood films are significantly longer than US produced films. 

- The median length for Indian produced movies that are availble on netflix is over 2 hours long while US based movies are around 1.5 hours (1 hour 30 minutes)
  - **India:** The inner quartile ranges from *1 hour and 43 minutes* to *2 hours and 21 minutes*
  - **USA:** The inner quartile ranges from *1 hour and 21 minutes* to *1 hours and 45 minutes*


### USA and India: taking a closer look at genre

```{r}
usa_listedin <- country_listedin %>%
  filter(country == "United States") %>%
  count(listed_in) %>%
  arrange(listed_in) %>%
  na.omit()

india_listedin <- country_listedin %>%
  filter(country == "India") %>%
  count(listed_in) %>%
  arrange(listed_in) %>%
  na.omit()

fig <- plot_ly(india_listedin, x = ~listed_in, y = ~n, type = 'bar', name = 'India', color = I("red"))
fig <- fig %>% add_trace(usa_listedin, x = usa_listedin$listed_in, y = usa_listedin$n , name = 'USA', color = I("black"))
fig <- fig %>% layout(xaxis = list(title = 'Genre'),yaxis = list(title = 'Count'), barmode = 'stack')
fig
```

***

- Omitting the *International Movies* genre, since movies/tvshows produced in India typically end up in this category due to location, both the United States and India produce a significant amount of Movies under the drama category

- There are **584 movies** produced in the United States that are labeled under **Drama**
- There are **611 movie** s produced in India that are labeled under **Drama**


### USA and India: Taking a closer look at maturity rating proportion distribution (regardless of genre)

```{r}
usa_countrydata <- country_data %>%
  filter(country == "United States") %>%
  group_by(country) %>%
  count(rating) %>%
  arrange(rating) %>%
  na.omit()

india_ratingdata <- country_data %>%
  filter(country == "India") %>%
  group_by(country) %>%
  count(rating) %>%
  arrange(rating) %>%
  na.omit()

india_rating_plot <- plot_ly(india_ratingdata, x = ~rating, y = ~n, type = 'bar', name = 'India', color = I("red"))
usa_rating_plot <- plot_ly(usa_countrydata, x = ~rating, y = ~n , name = 'USA', color = I("black"))
subplot(india_rating_plot, usa_rating_plot) 
```

***

- From my understanding, and from the movies I've watched, India typically produces movies that are more family friendly so it makes sense that the most popular rating is TV-14

- United States typically produce more movies and tv shows that are more mature however there is a wider variety on content type when comparing it to India. This may be because Netflix was founded in the US which makes the avaiale content steer more towards US-based produced content

### Most Frequent Description Words for Dramas Produced In India

```{r}
india_wordcloud_data <- dplyr::filter(netflix, grepl('Dramas', listed_in))
india_wordcloud_data <- dplyr::filter(india_wordcloud_data, grepl('India', country))

corpus = Corpus(VectorSource(india_wordcloud_data$description))
corpus = tm_map(corpus, PlainTextDocument)
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("cloth", stopwords("english")))
corpus = tm_map(corpus, stripWhitespace)

dtm <- TermDocumentMatrix(corpus)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
india_data <- data.frame(word = names(v),freq=v)

usa_wordcloud_data <- dplyr::filter(netflix, grepl('Dramas', listed_in))
usa_wordcloud_data <- dplyr::filter(usa_wordcloud_data, grepl('United States', country))

corpus2 = Corpus(VectorSource(usa_wordcloud_data$description))
corpus2 = tm_map(corpus2, PlainTextDocument)
corpus2 = tm_map(corpus2, tolower)
corpus2 = tm_map(corpus2, removePunctuation)
corpus2 = tm_map(corpus2, removeWords, c("cloth", stopwords("english")))
corpus2 = tm_map(corpus2, stripWhitespace)

dtm2 <- TermDocumentMatrix(corpus2)
m2 <- as.matrix(dtm2)
v2 <- sort(rowSums(m2),decreasing=TRUE)
usa_data <- data.frame(word = names(v2),freq=v2)
```

```{r}
set.seed(4)
barplot(india_data[1:10,]$freq, las = 2, names.arg = india_data[1:10,]$word,
        col ="red", main ="Most Frequent Desription Words for Dramas In India",
        ylab = "Word frequencies")
w <- wordcloud(words = india_data$word, freq = india_data$freq, min.freq = 10,
          max.words=25, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(10, "Dark2"))
```

***

- Most movies that are produced in Bollywood typically geer towards romance so I am not surprised that "Love" is a popular description word


### Most Frequent Description Words for Dramas Produced In USA

```{r}
set.seed(4)
barplot(usa_data[1:10,]$freq, las = 2, names.arg = usa_data[1:10,]$word,
        col ="black", main ="Most Frequent Description Words for Dramas In USA",
        ylab = "Word frequencies")
w2 <- wordcloud(words = usa_data$word, freq = usa_data$freq, min.freq = 10,
          max.words=25, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(10, "Dark2"))
```

***

- The word "young" the the number one most popular describer word for both US and Indian dram movies

### Duration Distribution of USA and Indian Produced Drama Movies

```{r}
india_duration_drama <- dplyr::filter(netflix, grepl('India', country))
india_duration_drama <- dplyr::filter(india_duration_drama, grepl('Dramas', listed_in))
india_duration_drama <- dplyr::filter(india_duration_drama, grepl('Movie', type))

usa_duration_drama <- dplyr::filter(netflix, grepl('United States', country))
usa_duration_drama <- dplyr::filter(usa_duration_drama, grepl('Dramas', listed_in))
usa_duration_drama <- dplyr::filter(usa_duration_drama, grepl('Movie', type))

fig_dur_drama <- plot_ly(y = ~india_duration_drama$duration, color = I("red"), type = "box", name = "India")
fig_dur_drama <- fig_dur_drama %>% add_trace(y = ~usa_duration_drama$duration, color = I("black"), name = "USA")
fig_dur_drama <- fig_dur_drama %>% 
  layout(title = "India VS USA Duration Of Drama Movies", 
         yaxis = list(title = "Duration In Minutes"))

fig_dur_drama

```

***

- I went back and did another boxplot of duration of India and USA produced movies but this time under the Drama category. I found that the distributions were similar


### Duration Across Genres

```{r}
india_duration2 <- dplyr::filter(netflix, grepl('India', country))
india_duration2 <- dplyr::filter(india_duration2, grepl('Movie', type))

indiadensity <- ggplot(india_duration2, 
       aes(x = duration, 
           y = rating, 
           fill = rating)) +
  geom_density_ridges() + 
  theme_few() +
  labs(title = "India Movie Duration Distribution", x = "Duration in Minutes", y = "Maturity Rating") +
  theme(legend.position = "none")

usa_duration2 <- dplyr::filter(netflix, grepl('United States', country))
usa_duration2 <- dplyr::filter(usa_duration2, grepl('Movie', type))

usadensity <- ggplot(usa_duration2, 
       aes(x = duration, 
           y = rating, 
           fill = rating)) +
  geom_density_ridges() + 
  theme_few() +
  labs(title = "United States Movie Duration Distribution", x = "Duration in Minutes", y = "Maturity Rating") +
  theme(legend.position = "none")

require(gridExtra)
gridExtra::grid.arrange(indiadensity, usadensity)
```

***

- I went and did a quick density plot using geomridges